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Abstract 

The Buridan's ass paradox is characterized by perpetual indecision between two 
states, which are never attained. When this problem is formulated as a dynamical 
system, indecision is modeled by a discrete-state Markov process determined by the 
system's unknown parameters. Interest lies in estimating these parameters from a 
limited number of observations. We compare estimation methods and examine how 
well each can be generalized to multi-dimensional extensions of this system. By quan- 
tifying statistics such as mean, variance, frequency, and cumulative power, we con- 
struct both method of moments type estimators and likelihood-based estimators. We 
show, however, why these techniques become intractable in higher dimensions, and 
thus develop a geometric approach to reveal the parameters underlying the Markov 
process. We also examine the robustness of this method to the presence of noise. 
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1 Introduction 



In the classical paradox known as "Buridan's Ass," first posed in the 14th century by Jean 
Buridan, an ass is placed midway between a pail of water and a bale of hay. Assuming 
the donkey prefers the closer object, he theoretically never decides to approach one or 
the other when placed equidistant between the two (Lamport, 2012). This problem can 
be generalized as a dynamical system in which the donkey is free to roam between the 
two objects, nevertheless changing his preference before ever actually reaching one or the 
other. In this formulation, the donkey's indecision is modeled by a discrete-state Markov 
process, the state being the donkey's current preference. The parameters governing this 
Markov process are the probabilities of the donkey switching that preference. We begin by 
exploring the mathematical generalization of the classical, two-state paradox with the goal 
of estimating these parameters. We consider both method of moments type estimators and 
likehhood-based estimators to achieve this goal. We also develop an algorithm that allows 
us to directly estimate the parameters. 

We arc interested in the behavior of the system in higher dimensions, so we also consider 
the donkey in a triangular pen. By adding one more state to the system and extending 
the problem from one to two dimensions, many of the methods we explore for the donkey 
on a line become intractable. Nevertheless, our algorithm, which detects the donkey's 
state at each unit time step, still allows us to measure the parameters of the system. In 
the absence of measurement noise, this state detector gives us the most reasonable values 
for our parameters. In the presence of noise, though, the state detector performs poorly. 
Consequently, we employ methods of denoising the system before applying state detection. 



2 Donkey on a Line 

Following the setup of Buridan's Ass, we place the donkey on a line between a pail of water 
Bit X — and a bale of hay at x — 1. We define state as the state in which the donkey is 
moving towards the pail of water, and state 1 as the state in which the donkey is moving 
towards the bale of hay.^ The donkey slows as he comes closer to either object, so he never 
actually reaches the water or the hay. The donkey's movement is modeled by the following 
differential equations: 

State 0: ^ = v(-x) (1) 

State 1: — = vil -x). (2) 
dt ^ ' ^ ' 

The constant v controls the speed of the donkey. While the donkey obeys these differential 
equations in a continuous fashion, we allow him to change state only at unit time steps. We 



^For simplicity, the donkey is always initially placed at a: = 0.5 and is taken to be in state 0. 
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let Toi be the probability that the donkey switches from state to state 1 at any given time 
step, and Tiq be the probability that the donkey switches from state 1 to state 0. These 
arc the parameters we estimate in our mathematical model of Buridan's Ass. We assume 
we know the constant v (even if not known, it is easily measured), and we observe only the 
donkey's position at each of finitely many time steps. Notice that the parameters Tqi and Tiq 
change with neither time nor the donkey's current state, imparting them the independence 
that characterizes the stochastic switching of this system as a Markov process. 



2.1 Method of Moments Type Estimators 

One way to estimate Tqi and Tiq is the method of moments approach. The standard 
method of moments estimation scheme uses moments of measured data, such as mean 
and variance, to estimate unknown parameters (Hansen, 1982). The approach relies on 
deriving algebraic expressions for these moments in terms of the parameters. When these 
expressions are invertible, they give closed-form estimates of parameters in terms of only 
measured data. Furthermore, this technique can be used with feature statistics such as 
frequency and cumulative power. Since the donkey on a line system has two parameters, 
we require couplets, or pairs of invertible expressions, to estimate tqi and tiq. 



2.1.1 The Markov Process 

The probability distribution of the donkey's state at the nth time step is given by the 
discrete-state Markov chain 



(3) 



where 



1-Toi Tio 
Tqi l-Tio 



(4) 



and Pq is a vector whose entries are the initial probabilities of the donkey being in either 
state. Notice that A is a column stochastic matrix (one whose columns each have entries 
summing to 1). Any such matrix has an eigenvector associated with an eigenvalue of 1 
(see Appendix B). It is a well-known fact that under certain conditions^, this eigenvector 
is unique^ and represents a stable distribution to which any initial distribution vector will 



^The Markov chain must be recurrent, aperiodic, and irreducible (Orey, 1962). A recurrent Markov 

chain is one in which any state, once reached, will with probability 1 be reached again. Apcriodicity is the 
condition that for each state i, gcd{n : state i can be returned to in n steps} = 1. Finally, an irreducible 
chain is one in which any state can somehow be reached from any other state. 

■^Up to multiplication by a constant. For the eigenvector v to make sense as a probability distribution, 
it must be normalized so that its entries sum to 1. 
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converge after repeated applications of (3). For A, this eigenvector is 



■ Tip ■ 

Toi + Tio 

■ Toi + Tio. 



(5) 



This vector shows that the expected proportion of time the donkey spends in state is 
— ^^7^ — , and the expected proportion of time he spends in state 1 is — ^ — . These proportions 
motivate the conditional probability tree in Figure 1. 




State 



1 - Toi ^ State 



'^01^ State 1 



no 



State 



Toi + Tio State 1 



1 — Ti(P~~^ State 1 



(1 - Toi)tio 
Toi + Tio 

ToiTio 
Toi + Tie 

TpiTio 
Toi + Tio 

(1 - Tio) Toi 



Toi + Tio 

Figure 1: Conditional probability tree for donkey on a line 

The first two branches of the tree show the marginal probabilities of the donkey being 
in either state, as given by the eigenvector v. The second layer of branches identifies the 
conditional probabilities of the donkey staying in his current state or transitioning to the 
other. These probabilities come simply from the problem's construction. By the inde- 
pendence of the Markov chain, we can multiply connected branches to produce the final, 
joint probabilities of the tree. For example, the probability of the donkey being in state 



and then switching to state 1 



IS 



TOl+r 10 ' 



Notice this is equal to the probability of the 



donkey being in state 1 and then switching to state 0. Therefore, we define our frequency 
of transition as 



ToiTio 
Toi + Tio 



(6) 



We may think of a; as the probability of observing some state transition at a given time 
step. This is our first statistic for use in a method of moments couplet. 
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2.1.2 Continuous Dynamics 

Although the donkey's state is discrete, his position is continuous in both space and time. 
We let P{x, t) be the probability density of the donkey's position, x, at time t. Note that 
for fixed t, P{x, t) is a probability density of one variable. Furthermore, since the donkey 
cannot simultaneously be in both states, we can express P as the sum of state-dependent, 
conditional probabilities,*^ Po{x,t) and Pi{x,t): 

P{x,t)^Po{x,t) + Pi{x,t). (7) 

Over time, the probability associated with state is advected towards and the proba- 
bility associated with state 1 is advected towards 1. Probabihty is also transferred among 
states by the switching probabilities, tqi and tiq. These changes in probability density are 
captured by the conservation conditions 

dP d 

-Qf = -Q^ H-X)P0{X, t)] - ToiPoix, t) + noPlix, t) (8) 

dP d 

-^ = -—[v{l-x) P,{X, t)] + ToiPoix, t) - TioPl(x, t). (9) 

We are interested in the long term behavior of the system, and thus we look for time- 
invariant steady-state solutions. That is, we examine the case in which 

dPo dPi , , 

and we can, therefore, drop the dependence of Pq and P\ on t. Furthermore, we anticipate 
that in a steady-state solution, total fluxes are balanced: 

f (-x)Po(j-) = v{\ - x)Pi{x) (11) 
Po{x) = —Pi{x). (12) 

X 

The steady-state condition in state 1 is now 

= [v (1 - x) P,{x)] + Toi^Pi(x) - noPlix). (14) 



^By referring to Pq and Pi as probabilities, we do not mean that their integrals over their support equal 

1, but rather that they sum to a valid probability density function. If properly normalized, Pq and Pi 
would be the probability densities of the donkey's position given his being in state or state 1, respectively. 
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Applying the derivative operator in (14) and rearranging terms, we have 

1 — X 



dPi 

-V (1 - x) + vPi{x) 



-Toi- 



-PAx) + TioPi{.x) 



X 

dPi 1 — X 

V{l-X) —— = Toi Pl{x) - TioPl(x) + vPi{x) 



dx 
dPi 



X 



1 



a =Toi Pl{x) 

ox vx 



Tio- 



-Pi{x) + 



(15) 
(16) 

-Pi(x). (17) 



V (1 — x) ""^ ^ ' 1 — X 

This is a separable first-order ordinary differential equation. Using separation of variables 
we get 



-dPi = 



Toi 1 Tio 1 , 1 1 a 

+ - r dx 

V X V 1 — X [1 — X) 

— In (x) + — In (1 - x) - In (1 - x) + X 

Cx ^ [1 — x) " 



P,(x) 
ln(Pi(a;)) 

Pi{x) 

Substituting Pi{x) into (12), we find that 

Poix) ^ Cx^-^-^ {1 - xf-^ . 
We then combine Pq and Pi, yielding the probability distribution 

P{x) = Po{x) + Pi{x) 



Cxv^{l-xP 



no I 



Thus, the constant 



where 



^(m noV 

\ V ' V / 



n(^,^)= t x-^-\l-xY-^-' dx. 

\ V V ) Jo 



(18) 

(19) 
(20) 

(21) 

(22) 
(23) 

(24) 
(25) 



This makes P{x) a beta probability distribution (Johnson et al., 1995) with parameters ^ 
and The first two moments of P{x) are thus 

Toi 



1^ 



and 



(7^ = 



ToiTio 



(roi + noY + ^ + 1) 



(26) 



(27) 



The mean, /i, and the variance, a^, provide two more statistics with which to form couplets. 
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2.1.3 Cumulative Power 



A final statistic we examine is cumulative power. For a twice differentiable signal m, and 
finite t, cumulative power, F, is defined as 

F{t) = [\m"{s)Y ds. (28) 
Jo 

Cumulative power has been shown to be 0{t) for any m that is a finite sum of sine and 
cosine functions. Moreover, the average derivative of F for such an m is pivotal with respect 
to the frequency and the amplitude of m (Quinn, 2011). Thus, measurements of the rate of 
growth in F (with respect to time) can provide statistics amenable to method of moments 

type estimators. 

In the donkey on a line system, we take m to be the donkey's position, x. Although 
X is not a periodic function, we derive a similar condition to the above for its cumulative 
power. First, however, we must adjust the definition (28) to account for the times ti, . . . ,tn 
at which x is not twice differentiable. We note that ti, . . . are the times at which the 
donkey changes state. From here on, we redefine the cumulative power of x as 

F{t)= f\x"{s)Y ds+ r {x"{s)Y ds + ---+ r {x"{s)yds+ f\x"{s)Y ds, (29) 

Jo Jti Jt„-1 Jtn 

where x" — In Appendix D, F is shown to be O (t). Anticipating, then, that F{t) can 
be approximated by a line, we look to obtain another statistic for use in the method of 
moments approach: the expected value of in terms of tqi and tiq. In carrying out the 
calculation, we employ the gamma function F : 1R+ M. defined^ by 

poo 

r (a) = / e-H^-^dt. (30) 
Jo 

Two useful properties of the gamma function (Abramowitz and Stegun, 1964) are 

T{a)T{b) =2{a,b)T{a + b) (31) 
F (a + 1) = aF (a) . (32) 

From (32), we easily obtain the identity 

F (a + 2) = (a^ + a) F (a) . (33) 
Now beginning our derivation of the expected value of ^ , in state we have 

{x"f = [i-vx)'] ' = {v^xf = v^x^ (34) 

^M"*" denotes the set of positive real numbers. The gamma function can also be defined for complex 
numbers with positive real part (Abramowitz and Stegun, 1964). 



8 



and in state 1 we have 

(x"f = \{v{l - x))'] ' = {v''{x - 1))' = v\l - xf. 



(35) 



Let z be the donkey's current state. From (34) and (35), it is clear that ^ is conditional 
upon the donkey's current position and state. The expectation for therefore, is found by 
averaging ^ over all possible position and state combinations. The likelihood of observing 
any one of these combinations is conditional upon the parameters Tqi and Tiq. More formally, 



E 



' dF 






_~dt 


Toi, Tio 





/ [v'^x^ Pr (a;|2; = 0, tqi, tio) Pr (2; = 0|toi, tio) 
Jo 

+ v^{l - xfPr {x\z = l,Toi,Tio) Pr {z = l\roi,rio)]dx. 



(36) 



For ease of notation, we denote S — E [^ \ roi,rio] . Substituting known probabihties 
and probability densities, (36) becomes 



2 1 ^-1^1 \^ f ''"10 ^ J 

/ X ::r-/ " (1 — x) « dx 

Jo ^ ' n(^ + l,^) ^ ' W+rio) 



+ 



We can multiply by fractions equal to 1: 

lo n(^,^ + i)n(^ + 2,^ + i) 

n(^ + i,^ + 2)^^ 



dx 



(37) 



.110 
x) 



Toi 



+ 



n (^ + 1, ^) n (211 + 1, + 2) 

Factoring out certain constants in (38) we get 

"n(^ + 2,^ + i)/ no 



x) 



'+1 



Toi + Tio 
Toi 



dx 



Toi + Tio 



dx 



(38) 



-X 



TQI 



+1 



(1 — x) ^° dx 



+ 



^(^ + 1.^ + 2)/ Toi 



-a; " (1 — a;) " (ia; 



(39) 



Recognizing that the integrals in (39) are of densities taken over their support and are thus 
equal to 1, we are left with 



n(3ii 

V V 



9 no 

^' V 



+ 1) 



Tio 



^ + 2 



+ Vt-01 + tio; + V^-oi + no 



Toi 



(40) 
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Applying (31) to the first term of the right-hand side of (40) yields 



n(^ + 2,^ + l)/ no 



r(^ + 2)r(if + 1) 



- + ^+3) 



^fc^ + l) W + noJ r (m) r (na + 1) _^_l_^ V^oi + rio 



no 



r(^ + 2)r(^ + ^ + i) / no 



Now, by applying (33) to (42) we have 

^(^ + 2,^ + 1)/ no 



( IOl\ _)_ IQl 

\ V J V 



no 



^(^,^ + 1) \roi + noJ (m + m + 1)% (m + m + 1) Vroi + Tio 
A similar computation on the second term of the right-hand side of (40) yields 
n(^ + l,^ + 2)/ roi \_ m' + f 



^(^ + 1.^^) V^oi + Tio; (m + m + i)2+(m + m + i) Vtqi + Tio 
Now, substituting (43) and (44) into (40), we have 

(^)' + ^)^io+((^)' + ^)roi 



.(roi + Tio) + ^ + + ^ + 

Finally, simplifying (45) gives 



S = v' 



V \ V V / 



(roi + Tio) (^ + ^ + 1) (^ + ^ + 2) 



(toi + Tio)(toi + Tio + v) ' 

Therefore, the average slope of the cumulative power of x is 



(toi + no) (toi + no + v) ' 
giving us our final statistic for use in a couplet. 
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2.1.4 Couplet Inversion and Results 



Now we have derived closed-form expressions for mean, variance, frequency, and the average 
slope of cumulative power, in terms of tqi and tiq. The method of moments approach 
demands, however, that each of these statistics can be measured given the data. For 
mean and variance, this is clearly possible. Measuring frequency, though, requires accurate 
detection of state transitions. Finally, cumulative power is perhaps the most difficult to 
measure, as it requires knowledge of the donkey's state at all time steps. In the following 
simulations, we ignore these issues as to identify which couplets perform most effectively. 
That is, we assume we have perfect measurements of the four chosen statistics. 

To complete our method of moments approach, we find invertible couplets of our closed- 
form expressions. Examining the equation for each statistic ((6), (26), (27), and (48)), we 
find that each one is symmetric with respect to tqi and tio, with the exception of the mean, 
II. This implies that any invertible couplet will include the mean. For example, inverting 
the equations for mean, /i — t-q^^+Vio ' frequency, ou — we find 

Toi = (49a) 

TlO = -. (49b) 
A* 

Inverting the equations for mean and variance, — .1°/^t,]? rm — we obtain 

li'^v (1 - /x) 

Toi = 5 (50a) 

TlO = 5 ■ (oUbj 

Finally, inverting the equations for mean and average cumulative power slope, 
S = , ^ " v"^"!? we find 

H^v^ — iiv^ + b 

Each of these couplet inversions vary in accuracy and precision, depending on the true 
values of tqi and tio. For instance. Figure 2 shows how well the mean and frequency 
couplet (49) performs for different pairs of roi and tiq. The surfaces shown were generated 
by running twenty simulations of donkey on a line, each one for 20,000 time steps, for every 
pair of parameters. There were 361 total pairs, created by meshing grids of 19 equally 
spaced points from 0.01 to 0.1. In each diagram, the vertical axis identifies the average 
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absolute error seen in the estimates (49a) and (49b). We can see that estimates improve 
as Toi and tiq both approach zero. Similar behavior was observed for the couplets (50) and 
(51). 




(a) Error in (49a) (b) Error in (49b) 



Figure 2: Average absolute residuals in parameter estimations using the mean and frequency 
couplet (49) 

Next we compare the different couplets to one another. Figure 3 shows boxplots of 
the residuals for the estimators (49), (50), and (51). These boxplots were generated in 
the same manner as the results of Figure 2, except simulations were run with a varying 
number of total time steps. As the number of observed time steps increases, estimates 
become more accurate. This is due to the law of large numbers: As the donkey takes more 
steps, his observable statistics — mean, variance, frequency, and cumulative power, in our 
case — approach their theoretical values, improving the accuracy of the inverted couplets. 

Based on these boxplots, the mean and frequency couplet produces, on average, the 
best estimates of tqi and tiq. The median residual (shown as the middle, red line) appears 
to converge to zero quickly, and the interquartile range (the vertical distance between the 
two blue lines) is narrow even for a small number of time steps. These qualities indicate 
that the mean and frequency couplet is a more efficient and less biased estimator of Tqi and 
Tio than the other two couplets. 

2.2 Likelihood-Based Estimation 

Another method of parameter estimation takes a likelihood-based approach. This approach 
involves calculating a function, F{tqi,tiq), that measures the likelihood of the observed 
data matching the parameters 7^ and tiq. One such function is the log-likelihood function. 
Given a set of data, the log-likelihood function evaluates the beta probability density 
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Figure 3: Boxplots comparing the performance of various couplets. Red markers denote out- 
liers, defined as those values separated from the median by more than 1.5 times the interquartile 
range. 
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function 



P ix\^i, no) = Cx"^-' (1 - x)"^-' (52) 

at the observed data (the donkey's observed positions). The function then takes the log of 
each result and sums these new results. Ideally, when this process is performed with the true 
tqi and tiq, the majority of data is concentrated in areas of high density, so that F{tqi, tiq) 
is maximized at (Toi,rio). To make this maximum a minimum, we examine —F{tqi,tio) 
instead of F(roi,rio). The estimates for tqi and rio are taken to be the location of this 
minimum. 

For example. Figure 4 graphs the negative log-likelihood function given a simulation 
of donkey on a line with 10,000 time steps. The input parameters were tqi = 0.05 and 
Tio = 0.08, with V = 0.1. The minimum of the negative log-likelihood function, shown in 
Figure 4 as a blue dot, is located at = 0.045 and f^o = 0.085. Thus, estimation errors 
are quite low. And as with couplet inversion, this likelihood-based technique benefits from 
greater observation time. 




2.3 State Detection 

While method of moments type estimators and likelihood-based estimators provide reason- 
able estimates for tqi and rio, we are, in fact, able to calculate these parameters directly 
from our data. By comparing two consecutive positions of the donkey, we have a vector 
that points in the direction the donkey is moving. If the donkey's position is increasing, 
we know he is in state 1. By contrast, if the donkey's position is decreasing, we know he is 
in state 0. By dividing the number of time steps that the donkey changes from state to 
state 1 by the number of time steps the donkey spends in state 0, we have our result for 
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Toi- A similar computation yields Tiq. This state detection technique provides the actual, 

observed values of tqi and Tiq. Because of the stochastic nature of the system, we do not 
expect these measured values to exactly match our input values. Nevertheless, by the law 
of large numbers, a greater number of observed time steps should lend measured values 
closer to the input values. Table 1 affirms this expectation. 



Table 1: Comparison of input and observed values for Tij 







Observed (xlO"'^) 




Input (xlO-^) 


N = 10000 


N = 100000 


TOI 


5.00 


4.73 


4.90 


TlO 


8.00 


6.83 


8.09 



3 Donkey in a Triangle 

We now consider a system in which the donkey has more than two states by extending 
the classic, one-dimensional problem to two dimensions. We let the donkey roam in a 
triangular pen with vertices located at the coordinates (0,0), (1,0), and (0,1). Let state 
be the state in which the donkey is moving toward (0, 0), state 1 be the state in which 
the donkey is moving toward (1, 0), and state 2 be the state in which the donkey is moving 
toward (0, 1).^ The donkey's continuous motion is modeled by the following equations: 

State 0: ^ ^ v{-x) ^ = vi'V) (53) 

State 1: ^ ^ v{l - x) f = ^(-y) (54) 

State 2: ^ = v(-x) ^^v(l-y). (55) 

Keeping the notation of the one-dimensional case, the donkey switches from state i to 
state j at a given time step with probability Tij. We now have a Markov process with six 
parameters to estimate instead of only two. 

3.1 The Markov Process 

Recall that in the one-dimensional case we examined a Markov chain to determine the 
expected proportion of time the donkey spends in each state. In the two-dimensional case 
we analyze an analogous process: 

Pn = Apn-1, (56) 



''For simplicity, the donkey is always initially placed at (|, |) and is taken to be in state 0. 
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where 



T01-T02 Tio T20 

Toi 1 — Tio — T12 r2l 

T02 T12 1 — T20 — T21 



(57) 



is still a column stochastic matrix. As before, A has eigenvalue 1 and associated eigenvector 
(see Appendix B) 



V = 



T21T10+T12T20+T10T20 



T2lTl0+Tl2T20+Tl0T20+T20T01+T02''"21+T0lT21+Tl0T02+T0lTl2+T02Tl2 

T20Toi+ro2T2i+roir2i 

T21Tlo+ri2T20+T10T20+''"20T01+ro2T21+'roir21+rio'ro2+roiri2+T02n2 

T" 10''"02+T'01T12+T02T12 

'-T"2iTi()+ri2r2o+TioT2o+T2oroi+ro2T2i+roir2i+Tioro2+Toiri2+T()2Ti2-l 



(58) 



In the one-dimensional case, wc were able to use the eigenvector associated with the Markov 
matrix to determine the frequency of the donkey switching between states. By adding just 
one more state to our system, it becomes difficult to define what is meant by the term 
"frequency." We now have several types of transitions, so using frequency as an intuitive, 
feature statistic is less appealing. Furthermore, as we continue to expand the system, 
finding a closed form for the eigenvector becomes computationally unrealistic; with just six 
states, the number of terms in the numerator of an eigenvector coordinate exceeds 1000 
(see Appendix C). 



3.2 Continuous Dynamics 

Hoping to derive statistics such as mean and variance in the two-dimensional case, we are 
interested in finding the probability distribution of our system. We let P{x, y, t) be the 
probability density of the donkey's position at time t. Again, we consider a decomposition of 
P into the state-dependent, conditional probabilities Po{x,y,t), Pi{x,y,t), and P2{x,y,t): 

P{x,y,t) ^ Po{x,y,t) + Pi{x,y,t) + P2{x,y,t). (59) 

We have the following conservation conditions, which are direct extensions of (8) and (9): 

dP d d 

dP d d 

-^ = -— [v{l - x)Pi] - — H-y)Pi] - rioPi - ri2Pi + tqiPo + r2iP2 (61) 
dP d d 

~d^~di ["'^"^^^2] - ^ [v{l - y)P2] - T20P2 - T21P2 + T02P0 + T12P1. (62) 
As in the one-dimensional case, we look for a steady-state solution, when 

dPo dP, dP2 „ 



16 



That is, the probabihty distribution of the location of the donkey does not change with 
time, and we may drop the dependence of Pq, Pi, and P2 on t. Adding (60), (61), and (62) 
in this case, we have 

= [v{-x)Po] - ^ [v{l - x)P,] - ^ [v{-x)P,] 

f) f) f) (^^) 

- H-y)P.\ - H-y)Pi] - ^ b(i - y)P2\ , 

where we have collected the x and y components. Assuming total fluxes are balanced in 
each direction, (64) implies 

= v{x)Po-v{l-x)Pi+v{x)P2 (65) 
= v{y)Po + v{y)Pi - v{l - y)P2. (66) 



Solving for Pq in (65) yields 



Po^-Pi + — - P2. (67) 

X 



Substituting (67) into (66) gives us 



= v{y) ( -Pi + — - P2 ) + v{y)Pi - v{l - y)P2 (68) 



X 



y 

Using (69) in (67), we see 



Pi = — . (69) 



Po = -i^ + ^-P2. (70) 

y y 



We then apply (69) and (70) to (62), the left-hand side of which we have assumed to be 0: 

= H-X)P2] - jr- [vil - y)P2] - T20P2 - T21P2 

OX ay 

[ XP2 P2 „\ XP2 ^ ' 

+ ro2 + P2 +ri2- 



y y J y 

Applying the derivative operators using the product rule and rearranging terms, we arrive 
at the following differential equation: 

dP2 .dP2 P2 f ^ ^ T02 ^ (ri2 - ro2)x \ 

-x^- + (1 - y)^r- = — -T20 - T21 + T02 H \ h 2^; . (72) 

ox dy V \ y y ) 
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Although we identify (72) as a first-order, semihnear partial differential equation 

(Polyanin and Zaitsev, 2004), wc arc currently not able to solve it for P2. In fact, we 
arc not sure this is a well-posed problem. Unfortunately, this means wc cannot apply the 
same methods as before to find expressions for moments such as mean and variance. More- 
over, we have no probability distribution with which to perform the same likelihood-based 
scheme as described in 2.2. 

3.3 State Detection 

As some of the methods applied to the donkey on a line system become intractable in a 
multi-dimensional extension, we turn to state detection as in 2.3. Before applying a method 
of state detection to the donkey in a triangular pen, however, we check that his movements 
are always linear. Consider the case when the donkey is in state 0. As stated earlier, his 
movement in the x direction is given hy ^ = v {—x), and his movement in the y direction 
is given hy ^ = v {—y). Dividing these, we find 

^ dx V {—x) X 

Since the slope given by (73) is exactly in the direction of (0, 0), we see that the donkey's 
movement is linear. A similar computation can be performed for states 1 and 2 to show his 
movement is linear in every state. From this fact, we build a state detector for the donkey 
using the following algorithm: 

1. Compare two consecutive positions of the donkey to make a vector pointing in the 
direction of the donkey's current movement; call it a. 

2. Construct vectors pointing from the donkey's current position to each of the coordi- 
nates of the triangle; call them bj for i = 0, 1, 2. 

3. Compare a to each bj by calculating the cosine of the angle between them via the 
formula 

cos^^^p^. (74) 

The hi that yields cos^^ = 1 is the vector that is parallel to the donkey's movement. 
Thus, the donkey is in state i. 

4. Repeat this process for all observed data. 

Now that we have a state detector, we can use it to find each rij empirically. For 
example, if we want to estimate rio, we first use the state detector to find the number of 
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time steps that the donkey spends in state 1. Next, find each time step that the donkey 
changes from state 1 to state 0. By dividing these results, we have our estimate for Tiq. 
This process is easily seen to generalize to any convex geometry in which the donkey can 
roam. 

Because of the stochastic nature of our system, we do not expect our observed Tjj's to 
match perfectly with our input Ty's. However, by the law of large numbers, we expect 
that they will converge as the number of time steps increases. Table 2 shows the results 
from a simulation with v = 0.01 and varied total observed time steps N. As expected, a 
greater number of observed time steps yields, on average, observed parameter values closer 
to input parameter values. 



Table 2: Comparison of input and observed values for Tjj 







Observed (xlO ^) 




Input (xlO-^) 


N = 10000 


N = 100000 




1.00 


0.76 


0.81 




6.00 


5.83 


6.56 


no 


2.00 


3.52 


1.71 




3.00 


2.24 


3.15 


T"20 


4.00 


5.13 


3.95 


T21 


5.00 


5.13 


5.02 



4 Noisy Data 

The state detector solves our two-dimensional extension of Buridan's ass, as it allows us 
to directly measure the parameters of the system. Although these measured values may 
not be equal to the input parameters, they are the values we wish to obtain as we continue 
our exploration of donkey in a triangle. They are the observed, as opposed to theoretical, 
probabilities of transitioning, making them the target estimates for further simulations. 

A system in which we have perfect measurements of the donkey's trajectory is not 
realistic. Real- world observations of dynamical systems involve some level of error, or 
noise. By adding noise to our system, we obtain data that is comparable to real-world 
measurements of dynamical systems. We now shift our focus to managing noise in our 
data, with the goal of obtaining reasonable estimates for the x^'s. 

To compare fairly the performance of the techniques discussed, we apply each to the 
same data set. This data was taken from the simulation used to construct Table 2. Prom 
here on we will assume we observe the donkey for N — 10000 time steps, so that the target 
estimates for the various Tij are those shown in the middle column of Table 2. Independent 
and identically distributed noise was added to each coordinate of the donkey's trajectory. 
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at each time step. This noise was taken from a normal distribution with a mean of and 
a standard derivation of 0.01. 

4.1 Linear Regression 

Linear regression is a statistical technique that aids in the study of linear relationships 
between variables. By modeling a set of data by a line, linear regression has potential to 
remove artifacts of noise. Outlined below is our application of linear regression to state 
detection: 

1. Consider the donkey's observed positions in the x and y directions separately. We 
let X and y be vectors containing the positions of the donkey in these directions, so 
that {xj,yj) is the donkey's observed location at the jth time step. We let t be the 
vector containing the times at which observations are made. In general, we assume 

2. For each j = l,...,A^ + l, where N is the number of observed time steps, we calculate 
the hues of best fit to the data sets {tk,Xk)i'^ and itk,yk)k^ , where W is the so- 
called viewing window. That is, W is the number of time steps from the current 
position considered in linear regression. (If j + W exceeds the length of x or y, then 
W is taken to be the largest possible viewing window.) 

3. We let fli be the slope of the best-fit line in the x direction, and 02 be the slope of 
the best-fit line in the y direction. We set a = [ai, 02]^. 

4. Next construct vectors pointing from {x,y), where x — ^^Ylk^ ^k, and y — 

W^J^it^Vkj to each of the coordinates of the triangle; call these vectors bj for 
i = 0,1,2. 

5. Proceed as in 3.3, now accounting for noise by taking the i that yields the greatest 
value in (74). 

Parameter estimates obtained using this algorithm are displayed in Table 3. Although 
increasing the viewing window W leads to more reasonable estimates, in no case does linear 
regression provide reliable information. Each estimate's absolute relative error is also shown 
in blue in Table 3, where the error is computed based on the observed Tij values of Table 2. 
Clearly the noise present in the system must be dealt with in a different manner. 

4.2 Denoising Techniques 

An alternative approach to linear regression is to denoise the observed data before applying 
state detection. That is, through a chosen method, we transform our observed, noisy 
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Table 3: Estimated values (xlO ^) for Ttj and relative error after adding noise, using linear 
regression with different viewing windows 





= 1 


W = 3 


W = 5 




415.67 (54471.04%) 


248.34 (32581.83%) 


160.30 (20995.21%) 




416.29 (7045.70%) 


239.39 (4009.18%) 


131.80 (2162.38%) 


Tin 


372.51 (10492.96%) 


214.55 (6001.10%) 


138.62 (3841.92%) 


Tio 


412.57 (18336.12%) 


193.85 (8562.11%) 


100.55 (4392.25%) 


T20 


372.17 (7154.76%) 


206.50 (3925.43%) 


122.19 (2281.89%) 


T-21 


402.30 (7742.08%) 


191.39 (3630.75%) 


94.52 (1742.60%) 




W = 10 


W = 15 


W = 20 


^01 


54.47 (7068.42%) 


28.32 (3626.27%) 


22.17 (2817.15%) 




53.31 (815.01%) 


29.16 (400.47%) 


22.17 (280.50%) 


no 


55.13 (1467.63%) 


36.78 (945.85%) 


29.14 (728.60%) 


T12 


40.34 (1702.84%) 


25.68 (1047.60%) 


17.93 (701.28%) 


T20 


57.25 (1015.92%) 


26.85 (423.31%) 


20.81 (305.74%) 


T21 


37.06 (622.41%) 


29.29 (470.88%) 


20.51 (299.77%) 



data into a data set which should represent the actual trajectory of the donkey. We now 
explore various methods of denoising and compare the accuracy of the resulting parameter 
estimates. Each method is presented as applied to a one- dimensional data set. That is, 
we denoise the data vectors x and y (each one containing N + 1 coordinates) separately 
and then apply the state detector presented in 3.3 to this denoised data, thereby producing 
estimates of the various Tij. 

4.2.1 Locally Weighted Polynomial Regression 

The first denoising technique we explore is locally weighted polynomial regression (LWPR). 
LWPR is a method of fitting a curve to a data set using low-degree polynomials (Fan 
and Gijbels, 1996). To utilize the versatility of polynomials while avoiding the Runge 
phenomenon that plagues high-degree interpolation (Runge, 1901), LWPR fits polynomials 
to subsets of the data rather than fitting a single polynomial to the entire set. A smoothing 
process is then applied to join the local polynomials into a single curve. This process is 
governed by a smoothing parameter, < h < 1, which is the percentage of the total data 
that is considered in smoothing about a given data point (Cleveland, 1979). 

In the same simulation as before, LWPR was performed using quadratic polynomials 
and h = 0.005. Results are shown in Table 4. Despite being far more reasonable than 
the estimates provided by linear regression, those provided by LWPR are still poor. The 
estimates are nearly correct in the ordering of the parameters (i.e. which are the smallest, 
and which are the largest), but they are, in every case, overestimates. Such distortion of 
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the parameters is caused by over-detection of state transitions. The performance of LWPR 
may be improved by utihzing polynomials of a different degree and adjusting the parameter 
h. 

Table 4: Estimated values for Tij and relative error after adding noise, using locally weighted 
polynomial regression 





Input (xlO^^) 


Observed (xlO 


Estimate (xlO ^) 




1.00 


0.76 


3.76 (395.40%) 


T02 


6.00 


5.83 


9.41 (61.54%) 


TlO 


2.00 


3.52 


8.48 (141.06%) 


Tl2 


3.00 


2.24 


4.89 (118.55%) 


T20 


4.00 


5.13 


7.16 (39.50%) 


T2I 


5.00 


5.13 


8.40 (63.76%) 



4.2.2 Wavelet Filtering 

Wavelets are functions that together form an orthonormal basis of an infinite-dimensional 
vector space^ consisting of other functions. There are many different wavelet families, 
each suitable to represent a unique class of functions. While providing the convenience 
of orthonormal bases, wavelets have the added utility of decomposing functions by both 
frequency and scale (Mallat, 2008). That is, frequency can be analyzed over subsets of a 
function's domain, in contrast to decomposition into a Fourier basis, where a function's 
entire domain is considered in each basis coefficient. 

Wavelets can be used to filter noisy signals by decomposing them into their orthonor- 
mal basis representation and adjusting coefficients polluted by noise. In our case, we use 
the symlets 8 wavelet basis (up to level 6) with multi-level, soft thresholding. Table 5 
shows the parameter estimations resulting from this denoising. The overestimates suggest 
that wavelet filtering still allows over-detection of state transitions. There are many other 
wavelet families, however, and a variety of thresholding techniques. Wavelet filtering can- 
not be discarded entirely before performing a thorough exploration of the many possible 
combinations. 

4.2.3 Butterworth Filtering 

The Butterworth filter is a signal processing tool used to denoise a system by permitting 
and dampening certain specified frequencies present in the data (Butterworth, 1930). There 
are three types of Butterworth filters: lowpass, highpass, and bandpass. A lowpass filter 
permits frequencies below a specified cutoff frequency Uc but gradually dampens those 

^Oftcn this vector space is L^(M), the space of square-integrable functions on the real hne. 
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Table 5: Estimated values for r^j and relative error after adding noise, using wavelet filtering 





Input (xlO^^) 


Observed (xlO ^) 


Estimate (xlO ^) 




1.00 


0.76 


22.52 (2863.47%) 




6.00 


5.83 


49.21(744.66%) 


no 


2.00 


3.52 


28.40 (707.54%) 


m 


3.00 


2.24 


21.06 (841.05%) 


T20 


4.00 


5.13 


51.39 (901.80%) 


1"21 


5.00 


5.13 


22.64 (341.27%) 



exceeding ooc- In contrast, a highpass filter permits frequencies above the cutoff but dampens 
those below it. A bandpass filter has both lowpass and highpass cutoff frequencies. 

The type of filter we apply depends on which frequencies are most prevalent in the 
observed data. Figure 5 shows that, at least in our sample simulation, such frequencies are 
low in the spectrum. The x coordinate of the donkey's trajectory is shown in Figure 5a, 
and the discrete Fourier transform F of x, defined by 

N+l 

E -2-!r^/^uj(k-l) 
XkC , (75) 

fc=i 

is shown in Figure 5b. As the majority of energy is contained in the lower frequencies, a 
lowpass Butterworth filter is suitable for our problem. Our goal is to eliminate noise that 
may dominate high frequencies while relying on the high energy in the low frequencies to 
suppress the effect of noise there. 



A.. 

2000 4000 6000 8000 10000 -SbO 500 

t ^ 

(a) Noiseless x against time (b) Discrete Fourier transform of x 

Figure 5: Identification of predominant frequencies in the donkey's trajectory 

To denoise the donkey's trajectory, a fifth order lowpass Butterworth filter was used, 
with ujc = 100. The resulting parameter estimates are given in Table 6. These results are 
easily the best seen thus far. 
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Table 6: Estimated values for r^j and relative error after adding noise, using butterworth 
filtering 





Input (xlQ-^) 


Observed (xlO ^) 


Estimate (xlO ^) 




1.00 


0.76 


2.68 (253.19%) 


T02 


6.00 


5.83 


4.29 (26.29%) 


no 


2.00 


3.52 


4.14 (17.77%) 


n2 


3.00 


2.24 


2.55 (13.89%) 


T20 


4.00 


5.13 


4.57 (12.95%) 


1"21 


5.00 


5.13 


3.51 (31.60%) 



4.2.4 Total Variation Denoising 



The last denoising approach we considered is a technique commonly used in image pro- 
cessing: total variation (TV) (Rudin et al., 1992). Unlike the Butterworth filter, TV is not 
spectral-based; that is, it does not involve decomposition into a Fourier basis. Rather, to 
denoise a vector x, TV seeks to solve the optimization problem 



argmm — x 

% 2" 



+ l|Vx||i, 



(76) 



where 



0- 
-110- 
0-11- 

0- 







-1 1 



(77) 



is a discrete derivative operator.^ The goal is to capture the overall behavior of x with 
an estimate x, while removing small, spurious oscillations that are merely artifacts of 
noise. Hence, we minimize the sum of a fitting term and a penalty term that discretely 
approximates the integral of the absolute value of the derivative (this integral is called 
a function's total variation). Our choice of 7 should refiect the relative importance of 
minimizing each term. With greater noise, oscillations become more pronounced, so 7 
would be placed at a lower value to assign more weight to minimizing total variation. 

Difficulty lies is solving (76), due to the non-differentiability of the £l-norm.^ To resolve 
this issue, we refer to Goldstein and Osher (2009), in which the authors rephrase the 



is shown with entries of ±1 since we assume measurements are taken a unit of time apart. The 
matrix, however, can be generahzed for nonunit and nonuniform spacing, each row with entries of ±Af for 
a unique At. 

^The ^1-norm is defined as 



x||i = X)fel^*:|- The i'2-norm is defined as ||x||2 = 



fcFfel 
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unconstrained minimization into a constrained one, in the process splitting the £1 and £2 
portions of the problem. In this new formulation, the authors solve the minimization using 
so-called Split Bregman iteration, which requires the convexity of the £1- and £2-norms. 
We demonstrate below this process as applied to (76). 

We first reformulate (76) as an equivalent unconstrained problem: 

argmin -||x — x||2 + ||d||i such that d = Vic. (78) 

The constraint in (78), though, is weakly enforced, introducing a second fitting term with 
weight parameter A: 

argmin ^||x — x||2 + — ||d — Vx||2 + ||d||i. (79) 

The optimization as presented in (79) is now amenable to Split Bregman iteration, which 
initializes do = bo = and follows a two step process for i = 1, . . . , n: 

/y A 

Step 1: (xi+i, dj+i) = argmin -||x — x||2 + -||d — Vx — bi||2 + ||d||i (80a) 

x,d 2 2 

Step 2: hi+i = bi + Vxi+i - d^+i. (80b) 
For step 1, the algorithm solves for x and d separately. First, 

Xi+i = argmin -||x — x||2 H — ||d — Vx — h^Wl, (81) 



which can be solved by differentiating with respect to x, and setting the result equal to 0, 
as the right-hand side is convex. That is, the minimizer x is such that 

7 (x - x)^ - A (d, - Vx - b,)^ V = 0^, (82) 

or equivalently 

(7/ + V^V) X = AV^ {di - hi) + 7x. (83) 
We now attain the solution for (81) by solving for x : 

X, = (7/ + V^ V) ' (A V^' (d, - b,) + 7x) . (84) 
Next, the algorithm produces dj+i coordinatewise by 

di+ij = shrink ((VCi+i) J + bij, 1/A) , (85) 



25 



where 



shrink(x, S) = — ■ max(|x| —6,0). (86) 
\x\ 

Once the n iterations of (80) are carried out, we have our denoised x in the form of (84). 

The above algorithm was apphed to our system with 7 = 0.5, A = 20, and n = 10. 
The parameter estimates obtained are displayed in Table 7. These results are comparable 
to those resulting from Butterworth filtering. The parameter with the lowest input and 
lowest observed value, tqi, is still estimated with the greatest error. This error suggests 
parameters with smaller values come with greater uncertainty. To explain this, we note 
that a smaller parameter value implies the associated transition is observed less frequently, 
perhaps making estimation schemes more sensitive to noise. 

Table 7: Estimated values for r^j and relative error after adding noise, using total variation 





Input (xlO^^) 


Observed (xlO "*) 


Estimate (xlO ^) 




1.00 


0.76 


3.45 (374.69%) 


T02 


6.00 


5.83 


4.44 (23.79%) 


TlO 


2.00 


3.52 


4.44 (26.30%) 


T12 


3.00 


2.24 


2.86 (27.59%) 


T20 


4.00 


5.13 


4.62 (9.86%) 


T2I 


5.00 


5.13 


3.08 (39.91%) 



As can be seen in Table 7, the total variation approach tends to create parameter esti- 
mates exhibiting less variance than the actual, observed values. That is, parameters with 
high values are often underestimated, and parameters with low values are often overesti- 
mated. This may be a result of a marked shortcoming of TV: the ability to preserve sharp 
changes in the data. While TV eliminates the spurious oscillations created by noise, it 
follows state transitions too early. Figure 6d illustrates this feature. 

Figure 6 also provides a graphical explanation for the deficiency of LWPR and wavelet 
filtering. The artificial oscillations created by noise are not fully removed (seen in Figure 6a 
and Figure 6b), leading to over-detection of transitions. Butterworth filtering (Figure 6c) 
both removes the artifacts of noise and captures the rapid state transition, even if slightly 
late. 

5 Conclusion 

When Buridan's ass is modeled as a dynamical system with a discrete-time, discrete-state 
Markov process, there are several effective techniques of parameter estimation. These tech- 
niques include method of moments type estimators, likelihood-based estimators, and state 
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(a) Locally Weighted Polynomial Regression 
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(b) Wavelet Filtering 
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(c) Butterworth Filtering (d) Total Variation 

Figure 6: Graphical comparison of denoising techniques applied to the data vector x 



5000 



detection. In higher dimensional extensions of this problem, however, the method of mo- 
ments approach becomes unmanageable. For n states, there are n{n — 1) parameters to 
estimate, and a method of moments approach requires an equal number of invertible, inde- 
pendent statistics. Moreover, deriving expressions for moments such as mean and variance 
is made difficult by potentially ill-posed partial differential equations. These problematic 
PDEs also eliminate the possibility of performing simple, likelihood-based estimation. 

The state detector is the one method that generalizes to higher dimensions. Further- 
more, state detection is possible in any convex geometry. Therefore, we focus on this 
method to estimate our parameters in the three-state problem. State detection provides us 
with the best results we hope to achieve from observing an ideal, noiseless system. Unfor- 
tunately, an ideal donkey is not a realistic donkey. Adding noise to the system provides a 
more realistic problem. State detection alone, however, performs poorly in estimating the 
parameters, as transitions are over-detected in the presence of noise. Combining our state 
detector with denoising techniques relieves much of this over-detection. Of the denoising 
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techniques presented, Butterworth filtering and TV lead to the best parameter estimates. 

Further research lies in refining the dcnoising techniques shown, as well as exploring 
other denoising methods, to produce better estimates of the parameters. Any of the estima- 
tion schemes presented could also be used to inform Markov chain Monte Carlo simulations. 
Provided with good initial estimates, such simulations can improve convergence by search- 
ing over narrower parameter regimes. 

Appendices 

A An Alternative Perspective 

Instead of a discrete-time Markov process, we can use a continuous-time Poisson process to 
model the stochastic switching nature of our dynamical systems. This formulation involves 
parameters dictating the average time observed between transitions, rather than parameters 
dictating the probability of transitioning in each unit of time. Nevertheless, the task of 
estimating these new parameters presents the same challenges associated with measurement 
error as before. An implementation of this model would again assume knowledge of the 
donkey's position at finitely many time steps. 

In this alternative perspective, we assume that transitions (e.g. state to state 1) each 
describe a Poisson process, so that the elapsed times between transitions follow exponential 
distributions. For instance, in the case of donkey on a line, a transition to state is followed 
by some random amount of time dtd (before a transition to state 1) sampled from an 
exponential distribution with mean parameter //qi. That is, the random variable dtoi has 
density 



Waiting times for transitions from state 1 to state are also governed by an exponential 
distribution, just with a different mean parameter fiiQ. 

Considering donkey in a triangle, there are two possible transitions from each state. 
Thus, the dynamical system contains six mean parameters to be estimated. Now, transi- 
tions away from a state are viewed as two competing Poisson processes. We can consider 
two dfs for the two possible transitions, and the transition with the smaller dt will occur. 

Generalizing this concept and (87) to any number of states, we say that waiting times 
for transitions from state i to state j are taken from the exponential distribution 




(87) 
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We will assume < /lij < oo for all i and j. Prom (88) we obtain the cumulative distribution 
function 



Fijit) = [ his) ds (89) 

J —oo 

t 



= / Us) ds (90) 





/' 

Jo f^ij 

t 



1 -^s 

e ds (91) 



s 

— _g Mij 







(92) 



= 1-e (93) 

With probability Fij(t) (assuming the donkey's only possible transition is to state j). the 
donkey spends less than t units of time in state i before making a transition to state j; 
that is, dtij < t. Rather, dtij > t with probability 

1 - Fij{t) = . (94) 
This observation leads to the following property of the Poisson process: Por any s,t > 0, 
Pr(dt„- > t + s\dUj >t) = ^'^(dUj >t + s) 

= — (96) 
e '''3 



= e ''i^ * (97) 
= PT{dtij > s). (98) 

Thus, the likelihood of a transition occurring in a given length of time is independent of 
when the previous transition occurred. This property is the continuous analog of the time 
independence that characterizes the steps of a discrete Markov process. 



A.l Expected Waiting Times 

As stated before, when more than one type of transition is possible (i.e. the system has 
more than two states), we must consider multiple dfs, each one an independent, random 
sample from a separate exponential distribution. But suppose we wanted to know how 
long, on average, the donkey spends in a given state. In other words, we wish to determine 
the expected waiting time for any transition to occur. 
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Preserving generality, let there be n states, and let the donkey be in state i. We are 
interested in how long we wait before observing the next transition: the time elapsed (since 
the transition to state i) will be denoted dti. This quantity is given by 

dti = min(dtio, ■ ■ ■ , dti^i-i, dti^i+i, . . . , dU^n-i)- (99) 

The probability of having had a transition by time t is 

VY{dti < t) = 1 - VY{dti > t) (100) 
= 1 - Pr(diio >t)--- PT{dti,i-i > t) PT{dti,i+i >t)--- PT{dti,n-i > t) (101) 

= 1 - ^e~^^*^ ■ ■ ■ *^ *j • • • *^ (102) 



__i 1 ± 1 ± I I ± t 



We recognize this as a cumulative distribution function 

Fi{t) = l-e~^'^'^^\ (104) 

Thus, dti has density 

fi{t) = jFi (105) 
= J]— e~(^^'^'^)*. (106) 

This is an exponential distribution with mean parameter 



That is, the average waiting time to observe a transition away from state i is Hi units of 
time. Assuming the donkey's state information is detectable, we can empirically observe 
fii and use such a measurement to estimate the various iiij. 



A. 2 Transition Probabilities 

We next ask what the probability is of observing a particular transition: If the donkey is 
in state i, what is the probability pij that his next transition is to state jl To calculate this 
probability, we need only to realize that doing so is equivalent to calculating the probability 
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e'V^'^^'l^kr dt 

l^ij Jo 



That is, we expect 



that dijj > dtik for all A; 7^ Since Pr((iijj = dij^ for some j ^ k) — 0, we do not worry 
about simultaneous transitions. We find 

Pij — Pi{dtij > dtik for all k = / [Pr{dtij = t) Pr((itio > t) ■ ■ ■ 

Jo 

Pr(dii,i_i > t) PT{dti,i+i >t)--- 

(108) 

PT{dtij_i > t) Pr{dtij+i >t)--- 

Pr{dti,n-i > t)] dt. 
Using (94), the integral in (108) can be evaluated to give 

Jo LU ^ y A ; ^^^^^ 

(110) 

(111) 

(112) 
(113) 



of the transitions from state i to be to state j. Again, assuming the donkey's state informa- 
tion is detectable, wc can measure pij (very similar to Tij in the Markov chain formulation) 
and use it to estimate fj^ij. 

A. 3 Pcirameter Estimation 

We have nearly derived an expression to estimate the various /lij parameters. We now 
simply solve (113) for /lij to yield 

fJ'i 



Pij 



(114) 



Recall that /Xj and pij are both easily measured given state information, so the empirical 
/lij can be calculated. 
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B Eigenvector of Markov Matrix with Eigenvalue 1 

As presented in 2.1.1 and 3.1, the normalized eigenvector associated with eigenvalue 1 of 
the 2 X 2 

1-Toi no 

Toi 1-Tio 



and the 3x3 



Toi 
To2 



TlO 
TlO - ^12 
Tl2 1 



T20 
T21 
^20 - T21 



Markov matrices are known. In fact, they can be calculated using standard techniques. 

With increasing size of the Markov matrix, however, these techniques fail by theoretical 
necessity: Determining the eigenvalues of an arbitrary n x n matrix requires solving an 
nth degree polynomial, for which a general solution cannot exist for n > 5 due to the 
Abel-Ruffini theorem (Rotman, 2005). 

We are only interested in determining the eigenvector associated with eigenvalue 1. 
Here wc derive a formula to calculate this eigenvector given an n x n Markov matrix. The 
following theorem and proof are taken from Lavallee (2011). The result is shown for a 
general column stochastic matrix. 

Theorem 1. If A is annx n column stochastic matrix, then A has an eigenvalue of 1 and 
associated eigenvector v given by 

(115) 



where M = 
kth column. 



Vk = det(Mfc,fc), 

A — I and M^^^ is the matrix formed from M by removing its kth row and its 



Proof. We wish to show 



= V 

(A - 7) V = 
Mv = 0, 



(116) 
(117) 
(118) 



or equivalently 



rrii^j det Mj- j = 



(119) 
(120) 
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for i = 1, . . . , n, where rriij is the entry of M in the ith row and the jth column. We will 

show (120) only for i = 1, as the remaining n — 1 cases are analogous. 

First notice that the rows of M are linearly dependent (each column sums to 0, so any 
row is the negative of the sum of all other rows), implying det M = 0. We also have 



m2,2 ■ ■ ■ rn2,, 



mi 1 



= mi 1 



^2,2 



^^n,2 ' ' ' '^n.n 



mi,2 



H \- mi^n 

mn,n rnn-1,1 
7713,3 



m3,i 



m„,i 



H h mi; 



En \--\n 
i=2^j,l -E,=2"^i,: 



^3 

m2A 



rrin-iA 



/,2 



m2,2 



mn-19 



mn,3 

En 
j=2 ^j,n-l 

m2,n-l 
mn-l,n-l 



mi,„_i 
m„_i,n_i 

En 
j=2 "^j,n 

m3,„ 



(121) 



(122) 



where, in the last n — 1 terms of the sum, we have expressed the first row of the matrix 
as the negative of the sum of the last n — 1 rows (with the appropriate columns omitted) 
of M. In computing the determinants of these matrices, we now remove the negative from 
the first row and spht the sums: 



Yl ^^'j ^i'i ^ "^I'l 



m2,2 



m2,n 



m„ ,2 • • • m„ . 



mi,2 J] 

j=2 



mi,„ Y 

j=2 



m2,l 7712,2 

rrin-iA mn-1,2 



^3,1 m3,3 • 
mn,i m„,3 • 

^^j,n—l 
m2,n-l 

• • m„_i „_i 



m 



],n 



nir. 



(123) 



In each of the n — 1 sums on the right-hand side of (123), only one j yields a first row that 
is not identical to one of the other rows (in particular, for the sum following mi,k, this j 
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equals k). As all other j yield a determinant of 0, each sum reduces to a single term: 



^mijdetMjj^mi, 



"^2,2 • • • '^2,n 



rrir 



m2,l m2,3 ••• "7,2,n 
"^3,1 "^3,3 ••• m3,„ 



mn,\ mn,2 ■ ■ ■ rnn,n-i 

"7,2,1 "72,2 ••• m2,n-l 

"7n-l,l 1Tln-l,2 ' ' ' "7„_i^„_i 



(124) 



We now shift down the first row of the last n — 2 matrices in (124) so that rows are in order 
(for the mi^k term, this requires k — 2 transpositions, or "flips," of rows). Furthermore, 
each transposition changes the sign of the determinant, so we have 



"72,2 



"72,n 



"7n,2 • • • "7n,n 



+ (-l)V2 



"72,1 "72,3 
"73,1 "73,3 



"72,n 
"73,n 



+ --- + (-l)"-imi, 



7772,1 "72,2 



(-i)'+V,i 



"^n-1,1 "7„-l,2 
"7n,l "7n,2 

"72,2 ■ ■ ■ "72.^ 



"7n,l "7„,3 • • • mn,n 
"72,n-l 

• "7n-l,n-l 
"7n,n— 1 



"7n,2 



"7^ 



+ (-i)^+V 



"72,1 "72,3 ••■ "72,n 
"73,1 "73,3 ••• "73,n 



"7n,l "7n,3 



nir. 



+ --- + (-l)^+"mi,„ 



"72,1 "72,2 

"7„-l,l "7n-l,2 
"in,l "7n,2 



"72,n-l 

"7n-l,n-l 
"7n,n— 1 



= X(-l)^+^mi,,detMi,,.. 



(125) 



(126) 



(127) 
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We recognize (127) as Laplace's formula by cofactor expansion for the determinant (Treil, 
2009), so we have 



J]mijdetMi,,- = ^(-l)i+^mi,,detMi,. 

= detM 
= 0. 



(128) 

(129) 
(130) 

□ 



C A Conjecture Regarding Eigenvectors 

Theorem 1 provides a way of computing the eigenvector corresponding to the eigenvalue 1 
of the n X n Markov matrix 



A 



1 - Toi ro,„_i Tio 

Toi 1 — Tio — Ti2 — • • • — Ti „_i 



Tn-1,0 



Tb,n-1 



Tl,n-1 



1 — Tn-lfl 



Tn-l,n-2 



■ (131) 



Nevertheless, the computation (115) has increasing complexity with increasing n. For ex- 
ample, using MATLAB®. we were able to compute the desired eigenvector only for n < 6. 

For n > 6, the computation proved too complex to execute. To gain a sense of this com- 
plexity, we examined the growth of the number of summands in a single application of 
(115). 

For example, in the 2x2 case, A has the normalized eigenvector 

Tio 



Toi + Tio 

Toi 
.Toi + Tio. 



In the 3x3 case. 



T21T10 + T12T20 + Tior20 



T21T10 + T12T20 + T10T20 + T20T01 + T02T21 + T01T21 + T10T02 + Toiri2 + T02T12 

T20T01 + T02T21 + T01T21 

T21T10 + T12T20 + T10T20 + T20T01 + T02T21 + T01T21 + T10T02 + T01T12 + T02T12 

T10T02 + T01T12 + T02T12 

.T21T10 + T12T20 + T10T20 + T20T01 + T02T21 + T01T21 + T10T02 + T01T12 + T02T12. 
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With four states, the eigenvector (not normahzed) becomes 



' TlOT"20T"30 + T10T20T31 + TioT2ir30 + T10T20TS2 + T10T21T31 + T12T20TSO + T1QT21T32 + noT"23T"30 ' 
+T12T20T31 + ri3T2or3o + TwT22,Tzi + T12T20TZ2 + T13T21T30 + Ti2r23T30 + T13T20T32 + ri3T23T30 

T0lT20T2,0 + TmT20T3l + TqiT21T2,0 + T01T20T32 + 7"0l7"2lT"31 + To2T2lT2,0 + TqiT21T2,2 + ToiT2iT3Q 
+T02T21T31 + ro3T2or3i + roir23T3i + ro2T2ir32 + ro3T2iT3i + T02T23T31 + ro3T2iT32 + T03T23T31 

T02T10T3O + ToiTi2r30 + ro2rior3i + TqiTuT-h + ro2rior32 + To2Tl2T-iO + T01T12T32 + Ti)2Tl2T-il 
+T02T13T30 + T03''"10T32 + Toiri3r32 + ro2Ti2r32 + ro3Ti2T3i + T02T13T32 + T03T12T32 + T03T13T32 

''"03T10T2O + T01T13T20 + To3rior2i + roiri3r2l + ro2'7"iO'723 + To3Ti2r20 + T01T12T23 + T02T13T21 
.+T03no''"23 + T03Tl3''"20 + T01T13T23 + T02T12T23 + T03T13T21 + T02T13T23 + T03T12T23 + T03T13T23. 

It is easy to see from (115) that for a given n, each entry of the eigenvector v wiU contain 
the same number of summands before normahzing. Counting the number of monomials in 
each numerator after normahzing, we see a possible pattern, illustrated in Table 8. 

Table 8: Number of monomials in a coordinate of the eigenvector corresponding to eigenvalue 
1 of Markov matrix 



Number of States 


Terms in Numerators 




2 


1 


= 2° 


3 


3 


= 3^ 


4 


16 


= 42 


5 


125 


= 53 


6 


1296 


= 6^ 



We are lead to the following conjecture. 

Conjecture 1. If A is the nxn matrix given by (131), then each coordinate of the eigen- 
vector V given by (115), when expressed in terms of the various Tij, is the sum of n"'~^ 
monomials, each with coefficient 1. 

If this conjecture is true, a derivation may proceed as follows. We let TZ be the number of 
monomials in a coordinate of v, the number we conjecture is equal to n"^"^. Let M = A — I 
as in Appendix B. As prescribed by Theorem 1, to compute the kih. entry of v, we must 
calculate detM^^fe. To do so, we will use the Leibniz formula (Treil, 2009), presented here 
for a general matrix B: 

n 

det S = sgn a JJ hi^a{i), (132) 
where Sn is the symmetric group on n elements, and hij denotes an entry in B. 
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In applying this formula, for example, to 



— Toi — • • • — To,n-l 
TOI 



— Tio — ri2 



Tn-2,0 
Tn-2,1 



T l,n-2 



~'^n-2,0 



Tn-2,ri-3 ~ '^ri-2,n-l 



we notice that 

1. Each term in the sum of (132) is a product of n — 1 entries of M„ „, exactly one from 
each row and column. 

2. In a particular summand of (132), the sign of each monomial produced is the same 
and is determined by two things. First, each diagonal entry in the product forming 
the monomial contributes a power of —1. Second, the signum of the permutation 
involved could contribute an additional factor of —1. 

3. Diagonal entries being involved in a particular summand of (132) is equivalent to the 
permutation involved having a fixed point. 

4. The number of diagonal entries involved in a particular summand of (132) determines 
how many monomials the summand contains. In particular, if /(ex) denotes the num- 
ber of fixed points of the permutation cr, then the associated summand will contain 
{n — ly^'^^ monomials. 

The assumption we now make, albeit possibly incorrect, is that monomials of oppo- 
site sign will cancel to leave only monomials with identical sign. In this case, the four 
observations above imply 



The factor of (—1)"^^ in (133) ensures TZ is always positive. If n is even, then the mono- 
mials comprising v, as determined by (115), have negative coefficients. This fact is seen 
by noting that the majority of monomials come from the summand associated with the 
identity permutation — ^the summand in (132) that multiplies all diagonal elements. Since 
the identity permutation has positive signum, the sign of the monomials in this summand 
is entirely determined by n. If n is even, then n — 1 is odd, so the sign of these monomials 
is (— l)"""*^, as there are n — 1 diagonal entries. 

Provided our assumption is correct, proving Conjecture 1 is now reduced to showing 
the right-hand side of (133) is equal to The sum in (133) may be easier to manipulate 
if permutations were grouped by cycle structure. 




(133) 
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D The Growth of Cumulative Power 



In Quinn (2011), the author shows that, for finite sums of sines and cosines, cumulative 
power (28) is O (t). Here we prove the same growth relationship for the cumulative power 
(29) of the donkey's position, x, in the one-dimensional case. 

Theorem 2. // x obeys (1) and (2) and switches between the two states by the Markov 
process (3), then its cumulative power (29) is O (t). 

Proof. Notice from (34) and (35) that 

0<{x"{s)y<v\ (134) 

since 

< x{s) < 1 (135) 
for all time s, under the assumption < x{0) < 1. Thus, we have 

F{t)^ f\x"{s)f ds+ f\x"{s)f ds + ---+ f" {x"(s)y ds+ f{x"{s)Y ds (136) 

JQ Jtl Jt„-1 Jtn 

rti pt2 rtn rt 

< ds+ ds + ---+ ds+ ds (137) 

Jo Jtl Jt„-i Jt„ 

= [(tl - 0) + (t2 - tl) + ■ ■ ■ + {tn - tn-l) + (t - tn)] (138) 
= vH, (139) 

for all time t. Since is constant and cumulative power F is always positive, (139) shows 
FeC(t). □ 
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